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Abstract 



We study the problem of allocating stocks to dark pools. We propose and analyze an optimal approach 
for allocations, if continuous- valued allocations are allowed. We also propose a modification for the case 



when only integer- valued allocations are possible. We extend the previous work on this problem ( Ganchev 



et al. 2009) to adversarial scenarios, while also improving on their results in the iid setup. The resulting 



algorithms are efficient, and perform well in simulations under stochastic and adversarial inputs. 



1 Introduction 



In this paper we consider the problem of allocating stocks to dark pools. As described by ( Ganchev et al. 



20091), dark pools are a recent type of stock exchange that are designed to facilitate large transactions. A 
key aspect of dark pools is the censored feedback that the trader receives. At every round the trader has a 
certain number V* of shares to allocate amongst K different dark pools. The dark pool i trades as many of 
the allocated shares Vi as it can with the available liquidity. The trader only finds out how many of these 
allocated shares were successfully traded at each dark pool, but not how many would have been traded if 
more were allocated. 

It is natural to assume that the actions of the trader affect the volume available at all dark pools at 
later times. Similarly, it seems natural that at a given time, the liquidities available at different venues 
should be correlated: we would expect counterparties to distribute large trades across many dark pools, 
simultaneously affecting their liquidity. Furthermore, in a realistic scenario, these variables are governed not 
only by the trader's actions, but also by the actions of other competing traders, each trying to maximize 
profits. Since the gain of one trader is at the expense of another, this problem naturally lends itself to 
an adversarial analysis. Generalizing the setup of (Ganchev et al. 2009), we assume that the sequences of 



volumes and available liquidity at each venue are chosen by an adversary who knows the previous allocations 
of our algorithm. 

We propose an exponentiated gradient (henceforth EG) style algorithm that has an optimal regret guar- 
antee against the best allocation strategy in hindsight. Our algorithm uses a parametrization that allows it 
to handle the problem of changing constraint sets easily. Through a standard online to batch conversion, 
this also yields a significantly better algorithm in the iid setup studied in ( |Ganchev et al. 2009). However, 



the EG algorithm has the drawback that it recommends continuous-valued allocations. We describe how the 
problem of allocating an integral number of shares closely resembles a multi-armed bandit problem. As a 



result, we use ideas from the Exp3 algorithm for adversarial bandit problems ( Auer et al. 2003[ ) to design an 
algorithm that produces integer-valued allocations and enjoys a regret of order T' z/i with high probability. 



While this regret bound holds in an adversarial setting, it also implies an improvement on ( Ganchev et al. 
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2009 ) in an iid setting. We also study an efficient implementation of our algorithm using the idea of greedy 



approximations in Hilbert spaces (Jones 1992), (Barron 1993) 



In the next section we will describe the problem setup in more detail and survey previous work. We 
will describe the EG algorithm for continuous allocations and prove its regret bound and optimality in 
Section [3] In Section [4] we describe the algorithm for integer valued allocations. Section |4.4| describes an 
efficient implementation. Finally we present experiments comparing our algorithms with that of (Ganchev 
et al.[ |2009[ ) using the data simulator described in their paper. 



2 Setup and Related Work 



We generalize the setup of ( | Ganchev et al. 
. V T where V* 



V 



6{1,. 



2009) 



A learning algorithm receives a sequence of volumes 
, V}. It has K available venues, amongst which it can allocate up to V* units 



at time t. The learner chooses an allocation v\ for the ith venue at time t that satisfies $Z i=1 v l — ■ 
Each venue has a maximum consumption level s\. The learner then receives the number of units 
min(w*, s') consumed at venue i. We allow the sequence of volumes and maximum consumption levels to be 



chosen adversarially, i.e. Vt , s\ can depend on {i 
in terms of its regret 



}f-Li- We measure the performance of our learner 



Rt = max min(w*, s*) — min(«|, s*) 



where the outer maximization is over the vector opt £E {1, . . . , K} v and 



v t 



I(opt„ 



i.e., we compete against any strategy that chooses a fixed sequence of venues opt 1; 
allocates the i>th unit to venue opt,,. 



. , opty and always 



The work most closely related to ours is (Ganchev et al. 2009). In that paper, the authors consider the 
sequence of volumes V 1 , . . . , V T and allocation limits s\ to be distributed in an iid fashion. They propose 
an algorithm based on Kaplan-Meier estimators. Their algorithm mimics an optimal allocation strategy by 
estimating the tail probabilities of s\ being larger than a given value. They show that the allocations of their 
algorithm are e-suboptimal with probability at most 1 — e after seeing sufficiently many samples. Theorem 1 
in \ Ganchev et al.| |2009[ ) shows that, if the s\ is chosen iid, then the optimal strategy always allocates the 
ith unit to a fixed venue. This justifies our definition of regret in comparison to this class of strategies. 

The ideas used in our paper draw on the rich literature on online adversarial learning. The algorithm of 
Section [3] is based on the classical EG algorithm ( |Littlestone and Warmuth 1994). When playing integral 
allocations, we describe how the multi-armed bandits problem is a special case of our problem for V = 1. For 



the general case, we describe an adaptation of the Exp3 algorithm (Auer et al. 2003) for adversarial multi- 



armed bandits. To provide regret bounds that hold with high probability, we use a variance correction similar 
to the Exp3.P algorithm (Auer et al. 2003 ). Our lower bounds use information theoretic techniques, building 



on Fano's method (Yu 19931. The efficient implementation of our algorithm relies on greedy approximation 



techniques in Hilbert space (JJonesJ [1992]) , (|Barron| |1993j) . 



3 Optimal algorithm for fractional allocations 

Although the dark pool problem requires us to allocate an integral number of shares at every venue, we start 
by studying the simpler case where we can allocate any positive value for every venue, so long as they satisfy 
Si=i v l — V 1 ■ We start by noting that the reward function r\ — min^f , s') is concave in allocations v\. 
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Maximization of concave functions is well understood, even in an adversarial scenario through approaches 
such as online gradient ascent. We note that in this problem, the algorithm has access to the subgradient of 
the reward function. To see this, we define 

t _ f 1 if r l = v l (in 
9l ~ \ if r* < v\ K ' 

Then it is easy to check that g\ can be constructed from the feedback we receive, and it lies in the 
subgradient set g^|. Hence, we can run a standard online (sub)gradient ascent algorithm on this sequence of 

reward functions. However, the allocations v\ are chosen from a different set St = {v 1 '■ ^2d—\ v l < V*} a t 
every round. Using standard online gradient ascent analysis, we can demonstrate a low regret only against 
a comparator that lies in the intersection of all these constraint sets nf =1 St- However the regret guarantee 
can be rather meaningless if V is extremely small at even a single round. Ideally, we would like to compete 



with an optimal allocation strategy like (Ganchev et al. 2009). A slightly different parameterization allows 
us to do exactly that. 

Let us define A]£ = {a; 1 , . . . ,x v : J2i=x x^ = 1 Vv < V} to be the Cartesian product of V simplices, 
each in Then we can construct an algorithm for allocations as follows: for each unit v = {I, . . . , V}, we 
have a distribution over the venues {I, . . . ,K} where that unit is allocated. At time t, the algorithm plays 

vj — J2v=i x ti- K i s c l ear that this allocation satisfies the volume constraint. 

The comparator is now defined as a fixed point u £ A^. We compete with the strategy that plays 
according to v\ = J2 v =i u i- Then the best comparator u is equivalent to the best fixed allocation strategy 
opt € {1, . . . , K} v . It is also clear that if we can compete with the best strategy in an adversarial setup, 



online to batch conversion techniques (see Cesa-Bianchi et al ( Cesa-Bianchi et al. 2001)) will give a small 



expected error in the case where the volumes and maximum consumptions are drawn in an iid fashion. 
3.1 Algorithm and upper bound 

An online gradient ascent algorithm for this setup is presented in Algorithm [I] 

Algorithm 1 Exponentiated gradient algorithm for continuous-valued allocations to dark pools 
Input learning rate 77, bound on volumes V. 
Initialize x\ A = ^ for v e {1, . . . , V}, i € {1, . . . , K}. 
for * = 1,..'. ,T do 

Set «f = ££i*?,i- 
Receive r\ = min{u*, s*}. 
Set g\ as defined in Equation Q. 
Set g\ i = g\ if v < V*, otherwise. 
Update x v t+l>i oc exp^J. 
end for 

It can be shown that the algorithm enjoys the following regret guaranteee. 
Theorem 1. For any choices of the volumes V € [0, V] and of the maximum consumption levels s*, the 
regret of Algorithm jij with r\ — y/ (e '"^ T " over T rounds is 0{V\jT In K). 

Proof. The regret is defined as 

T K / V* \ T K 

R T = max ^ min u i> s * _ min (^ ! *' s ») 

u€A K t=l i=l \v=l J t=l i=l 

T V* 

< 2^2^\ U ~ x t) St- 
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Following the proof of Theorem 11.3 from Cesa-Bianchi et al ( Cesa-Bianchi and Lugosi 20061, we define 
v i = V9t i ~ v(9t) Tx t ■ Also, we note that the gradient is zero for v > V*. So we can sum over v from 1 to 
V rather than V*. Then we bound the regret as 



T V 

EE 

4=1 v = l 



(u v - Xt) T 9t - ^ In ^E x h ex P("i) 



1 / K 

- ln E x h ex P(^ 



Some rewriting and simplification gives the bound 



77 E E 

' t=i ti=i 

1 T ' 

n 



exp {yg t ,i) 



+ME 



U: In 



+ ln E^M ex P(*V 



A E 



KL(u*||xi) + 53 ln ^ a?t,i exppj 



Here, the last line uses the definition of KL-divergence and the fact that the telescoping terms cancel out. 
Now g v ti < 1 so that v" < 7/. If 77 < 1, then it is easy to verify that exp(z^) < 1 + v\ + (e — 2) (f") 2 . We also 

note that Y,t=i x h< = °- 

Also, each of the KL divergence terms in the above display is equal to ln A. This is because the optimal 
comparator will have a 1 for exactly one venue for each unit v. As we choose x\ to be uniform over all 
venues, we get the KL divergence between a vertex of the A"-simplex and the uniform distribution which, is 
In A. 

Hence we bound the regret as 



1. 



-VlnA 



^EX> fE^,(i + ^ + ( e -2)K) 2 ) 

' t = l «=1 \i=l , 



<^ ln *+^EE( e - 2 V 

' ' +— 1 n> — l 



-VlnK +{e-2)r]VT 



< 3VVTlaK, 



where the last step follows from setting ij = 



InK 
(e-2)T- 



□ 



3.2 Lower bound and minimax optimality 

We will now show that the online exponentiated gradient ascent algorithm in Algorithm [l] has the best regret 
guarantee possible. We start by noting that a a regret bound of 0(VT ln A) is known to be optimal for the 



experts prediction problem (Haussler et al. 1998 Abernethy et al. 2009 ). Hence we can show the optimality 
of our algorithm for V = 1 by reducing experts prediction problem to the dark pools problem. Recall that 
in the experts prediction problem, the algorithm picks an expert from 1, . . . , A according to a probability 
distribution p t at round t. Then it receives a vector of rewards pt with p t ^ € [0, 1], i = 1, .. ., K. In order 
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to describe a reduction, we need to map the allocations of an algorithm for the dark pools problem to the 
probabilities for experts, and map the rewards of experts to the liquidities at each venue. 

We consider a special setting where V* = 1 at all times. Since Vt = 1, the allocations of any dark pools 
algorithm are probabilities- they are non-negative and add to 1. Hence we set p ti = v\. We also set the 
liquidity s\ — pt,iPt,i- Then the net reward of a dark pools algorithm at round t is: 



K K K 

i—1 i—1 i—1 



where the last line follows from the observation that < pt,i < 1. Hence the net reward of the dark pools 
problem is same as that expected reward in the experts prediction problem. Using the known lower bounds 
on the optimal regret in experts prediction problems, we get: 



T K 

max > > [: 
{=1 i-i 



mm m , s, — mm 



pi. 



K 
3 = 1 



= Q.(Vf In K). 



We also note that the regret in the experts prediction problem scales linearly with the scaling of the 
rewards. Hence, if the rewards take values in [0, V], then the regret of any algorithm is guaranteed to be 
n(VVTlnK). 

For arbitrary V, we again consider the special setting with Vt identically equal to V. We would now like 
to reduce the experts prediction problem where every expert's reward is a value in [0, V]. At every round, we 
receive a vector of allocations v\. We set pt t i — vj/V. We receive the rewards pt t i from the experts problem, 
and assign the liquidities s\ — pt.iPt.i G [0, V]. Furthermore, 



min ( s iX) = ^min f ^,Pt,% \ = pt,iPt,i- 



The last step relies on observing that pt : i < V so that pt,iPt,i/V < pt,i- Now we can argue that the regrets 
of the two problems are identical as before. Hence the optimal regret on the dark pools problem is at least 
n(VVTlnK). As Algorithm [] gets the same bound up to constant factors in a harder adversarial setting 
than used in the lower bounds, we conclude that it attains the minimax optimal regret up to constant factors. 



4 Algorithm for integral allocations 

While the above algorithm is simple and optimal in theory, it is a bit unrealistic as it can recommend we 
allocate 1.5 units to a venue, for example. One might choose to naively round the recommendations of the 
algorithm, but such a rounding would incur an additional approximation error which in general could be as 
large as 0(T). In this section we describe a low regret algorithm that allocates an integral number of units 
to each venue. 

To get some intuition about an algorithm for this scenario, consider the case when V = 1. Then the 
algorithm has to allocate 1 unit to a venue at every round. It receives feedback about the maximum allocation 
level s* only at the venue where v\ = 1. This is clearly a reformulation of the classical if -armed bandits 
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problem. An adaptation of Algorithm [T] that uses the Exp3 algorithm ( Auer et al. 2003[ ) would hence attain 
a regret bound of 0(VTK In AT) for V = 1. Contrasting this with the bound of Theorem [l] for V = 1, we can 
easily see that the regret for playing integral allocations can be higher than that of continuous allocations 
by a factor of up to sfR. Indeed we will now show a modification of the Exp3 approach that works for 
arbitrary values of V. We will also show a lower bound. The upper bound shows that our algorithm incurs 
(9(T 2 / 3 ) regret in expectation, which does not match the 0{\/T) lower bound. However, it is still a significant 



improvement on Ganchev et al ( Ganchev et al. 2009 ) as we will discusss later 



4.1 Algorithm and upper bound 

We need some new notation before describing the algorithm. For a fractional allocation v\, we let f\ = \y\\ 
and d\ = v\ - \ v\\ . 

Now suppose we have a strategy that wants to allocate v\ units to venue i at time t. Suppose that we 
instead allocate u\ = f\ units with probability 1 — d\ and u\ = f\ + 1 units with probability d\. Using the 
fact that the maximum consumption limits are integral too 



Emin(t4 st) = d\ min(/| + 1, a\) + (1 - 4) ™Hfi, a\) 

4 if 4 < n 

ft + d\ i:V ./; • 1 



■ it t\ 
mm[v i ,s 



Thus, playing an integral allocation u\ according to such a scheme would be unbiased in expectation. Of 
course we need to ensure that we don't violate the constraint Y^f=i u \ — V* m ^his process. To do so, we let 
J2i=i d\ — Vt — 12i=i ft — m - Then we will use a distribution over subsets of {1, ... , K} of size m that has 
the property that ith element gets sampled with probability d\. It is clear that if there is such a distribution, 
then we will have the unbiasedness needed above. It will also ensure feasibility of u\ if v\ was a feasible 
allocation. Our next result shows that such a distribution always exists. 

Theorem 2. Let < d\ < 1, 4 = m f or Tn>l. Then there is always a distribution over subsets of 

{1, . . . , K} of size m such that the ith element is sampled with probability d\. 

Proof. Proof is by induction on K. For the case K = 2, m = 1, we sample the first element with probability 
d\. If it is not picked, we pick element 2. It is clear that the marginals are correct establishing the base case. 
Let us assume the claim holds up to K — 1 for all m < K — 1. Consider the inductive step for some K, m. 
We are given a set of marginals, < d\ < 1, Y^f=i 4 = m - We would like a distribution p on subsets of 
size m of {1, ... ,K} that matches these marginals. We partition these subsets into two groups; those that 
do and do not contain the first element. We correspondingly partition p — (pi,p2). Let Ni — (^Zi) an d 

-^2 = ( K m l ) b e the number of subsets in the two cases. Then we want Y^h=iP^) = Si=dPi(*) = d\ m order 
to get the right marginal at element 1. Hence, we can write p\ = d\ q±, p 2 = (l — d\)q2 for some distributions 
qi and qi on N\ and N 2 subsets respectively. Now we write 

t _ / (m-l)dj m(l-4) \ t 
dl ~{ m-d[ + m-d\ ) 1 ( ' 



for i > 1. Then 



K . , K 

(m-1) t \ - to 

=2 1 %=1 1 



E 



(3) 



G 



are marginals on subsets of size to — 1 and m respectively of {1, . . . , K — 1}, and are in [0, 1] as J2i=2 d\ 

: Hence there exist distributions qi and qi that attain these marginals using the inductive hypothesis. 
We set pi — d[qi, P2 = (l — d[)q2- Then Equations [2] and [3] together imply that we get the correct marginals 
for every element. □ 

For any allocation sequence v 4 , let p(d t ) be the probability distribution over subsets of {1, ... , K} guaran- 
teed by Theorem [2] For some constant 7 6 (0, 1], let d ti i = (1 — "i)d\ + Then let p(dt.i) be a distribution 
over subsets that samples the ith venue with probability dt i- We can construct this by mixing p{d\) which 
exists by Theorem [2] and mixing uniform distribution over subsets of size to. Also, we let V t i < V t be the 
largest index vq such that J2v°=i x ti — ft- We define a gradient estimator: 



9li 



I(s- > ft) - ' l — \iv<V t 



l(s\>vl)I(u\ = \vX\) if y 



+ 1 < v < V\ 



(4) 



To see why this gradient estimator is good, we first note that the gradient of the objective function at v\ 
can be written as 

^ = i(4> u |) = i( s *>4)-i( s * = ^), 

when v < V*. Then we can easily show the following useful lemma. 

Lemma 1. If an algorithm plays u\ = \vf] with probability d t .i and u\ — f\ otherwise, then gt as described 
in Equation is an unbiased estimator of the gradient at (v\, . . . 

An algorithm for playing integer- valued allocations at every round is shown in Algorithm [2] 

Algorithm 2 An algorithm for playing integer-valued allocations to the dark pools 
Input learning rate 77, threshold 7, bound on volumes V. 
Initialize x\ i = h for v = {1, . . . , V}. 
for t = 1.. . T do 

Set vl = Y,tiKi- 

Let p(d t i ) be the distribution over subsets from Theorem [2] 
Sample a subset of size to = Y^i=i dt,i according to p(dt t i). 
Play u\ — ff + 1 if i is in the subset sampled, u\ — ft otherwise. 
Receive r\ = min(u*, s*). 
Set i as defined in Equation (jij). 
Update x v t+l l cx x^exp^g^). 
end for 



We can also demonstrate a guarantee on the expected regret of this algorithm. 

Theorem 3. Algorithrr^^ with -q — ) , has expected regret over T rounds ofO^VTKf^^nK) 1 ^), 

where V is the bound on volumes V t and the volumes and maximum consumption levels s\ are chosen by 
an oblivious adversary. 

An oblivious adversary is one that chooses V* and s\ without seeing the algorithm's (random) allocations 
u\ . We note that the requirement that the adversary is oblivious can be removed by proving a high probability 
bound. We will describe a slight modification of Algorithm [2] that enjoys such a guarantee. 

Proof. Since the adversary is oblivious, we can fix a comparator u € ahead of time. For the remainder, 
we let E t denote conditional expectation at time t conditioned on the past moves of algorithm and adversary. 
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Then the expected regret is 



< E 



min ( E u * ' s H ~ E E min ( u *' s 

i=l \u=l / t=l i=l 

T if / V \ T K 

y^y^min ^uj,s- - ^2 ^2 min(«|, s*) 



+ 7^. 



Here, the second step follows from the fact that u\ would be unbiased for v\ without for the ^ adjustment. 
However, this adjustment costs us at most 7 Y^t=i m t ^ iTK in terms of expected regret over T rounds. For 
the first term, it is as if we had played the continuous valued allocation v\ itself. Again using the concavity 
of our reward function 



R T (u) < E 



= E 



5>« - x v t ) T g? 



v = l 
' V 



5>"-<) T (E t gn 



jTK. 



Here t he last step follow s from noting that g t is unbiased estimator of gt by construction just like in 
Exp3 (Auer et al. 2003). Now we note that the algorithm is doing exponentiated gradient descent on 
the sequence gt- Hence, we can proceed as in the proof of Theorem [l] to obtain 

1 t v / k \ 

Rt(u) < -VlnK+ -E V V In VxLexp^) + 1 TK. 

V v tlfri \t{ • J 

where v? = r(g" ti — ?7(<? t u ) T x t as before. Assuming a choice of r\ such that r\gli < 1, we note again that 
v\ < 1. So we can use the quadratic bound on exponential again and simplify as before to get 



T V K 



R T (u) < -V\nK+-Ej2Y,J2 x U^ 2 +^ TK 



t=l v=l i=l 
T V K 



t=l v=l i=l 



Now we can swap the sum over V and i to obtain 



t k v 



R T (u) < -VlnK + r 1 Ej2J2j2 x U~9h) 2 +^ TK 



1 = 1 , = 1 r = l 



T K 



1=1 (=i 



V t ,i 



v=l 
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Now we look at the two gradient terms separately. 

V t ,i V t ,i 



Vt,i v t ,t (- f 

E t "£xU~gh) 2 = $>?,< d t ,i \l{a\ > ft) 

v=l v=l I ^ 



t ^ / t^ K4 = ft) 
dt,i 



+ (i - d t ,i)i{4 > v\) 

<2vl + 2vl-. 

7 

Here, we used the fact that d t s > i as m > 1 and indicator variables are bounded by 1. Hence 



T K V M 

E EEE-M(5M) 2 <2Ty + 2^ 

t=l i=l v=l ' 



using v l — V ■ Next we examine the second gradient term 

w=Vt,i + l v=Vi,4+l 

=M(^;) 2 <j M ^ 7J i- <2 

if T < I- 

Hence, E^^L 1 X^=Vi »+i x t,i{9t,i) 2 — 2TK. Substituting the above terms in the bound, we get 

1 / TVK \ 

RAu) < -V \n K + 2rj \ TV + +TK) +jTK. 

V V 7 / 

Optimizing for 77, 7 gives 

M«) < 6(yrx) 2 / 3 (inif) 1 / 3 . 

□ 

We note that the term responsible for 0(T 2 / 3 ) regret is — - ■ While we assume that this can ac- 
cumulate at every round in the worst case, it seems unlikely that the liquidity s\ will be equal to /* very 
frequently. In particular, if the s*'s are generated by a stochastic process, one can control this probability 
using the distribution of s\ and obtain improved regret bounds. 

4.2 Variance correction and High probability bound 

We would like to show that the analysis of the previous section holds not just in expectation but also with 
high probability. This has two advantages. First, it tells us that on most random choices made by our 
algorithm, it has a low regret. Further, the high probability guarantee can be easily combined with a union 
bound to give a regret bound for non-oblivious (adaptive) adversaries as well. 

High probability bounds in bandit problems are often tricky because even though the gradient estimator 
is unbiased, its variance is typically large. Hence, using standard martingale concentration on the estimator 
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directly gives a worse 0(T 3 / 4 ) regret bound. To demonstrate a high probability guarantee of 0(T 2 / 3 ), we 
need to make a variance correction to our estimator g t . We define 

9t > i = 9i ' i + Kd^V n 5- (5) 
The high probability analysis makes repeated use of the classical Hoeffding-Azuma inequality as well as a 



version of Freedman's inequality from Bartlett et al Bartlett et al. (2008). which we state for completeness, 
inequality. 

Lemma 2 (Hoeffding-Azuma inequality). Let X±,..., Xt be a martingale difference sequence. Suppose 
that \Y t \ < c almost surely for all t g {!,..., T}. Then for all S > 0, 



' T \ 

\t=i J 



Lemma 3 (Bartlett et al. (2008)). Let X\, . . . ,Xt be a martingale difference sequence with \X t \ < b. 
Let 

Var t X t = Var(X t \Xi, . . . , A t _i). 

Let V = V ar tXt be the sum of conditional variances of X t 's and a — \fV . Then we have, for any 

S < 1/e and T > 4, 



J2 X t> 2max{2cr,6Vln(l/(5)} v /ln(l/(5) j <S\og 2 T 
vt=l / 



We will now prove a series of concentration results which will immediately give the desired regret bound 
when put together. The steps in our analysis closely resemble the technique of |Abernethy and Ra khlin 



(2009). The first concentration lemma shows that the regret of the integral allocations is close to their 



continuous valued counterparts. 
Lemma 4. 



3i : ^min(u l t ,s*) - ^min(u',s*) > V^T\n{K/8) + jT/k) < 5. 



Proof. We apply Lemma [2] to the martingale difference sequence X t = min(u*,s|) — E t min(u', s*). Then 
\X t \ <V.So 

(T T \ 

minK*, si) - ^ E t min( U *, s*) > V^/T\n(l/5) < S. 
t—i t—i ) 

But we note that by construction 

E t min(uf , 4) = dt,i min(/? + 1, a\) + (1 - d M ) min(/*, a\) 
= min(/* + d t ,i,s$) 
<min(/* + 4, S |) + ^. 

The statement of lemma then follows from the above inequality and a union bound over all K venues. □ 
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The next step is to show that the terms J2v=i ( u " — x t) T 9t an d J2v=i ( uV ~ x t V 9t are close. We proceed 
indirectly by first bounding the conditional variances. 

Lemma 5. 



Var t {(^-g^) T (u v -x v t )}<5 



K K , v x 2 

, d ti e-f dti 

i=i l ' 1 i=i l ' % 



We now combine this with Freedman's inequality to bound (g^ — g]!) T (u v — xj) 
Lemma 6. 

'TV , ^ 2 



" 9t ) T (u v - x%) > SOjTVy/Hl/S) + 2V ( ^ + 1 ) Hl/6) < 2VS\og 2 T 

Vt=l v =l 
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Proo/. We define the martingale X t = J2v=i(9t ~ 9tV (V ~ Then 1^*1 < ^ (f + l) by Holder's 

inequality. Applying Hoeffding-Azuma inequality gives the result. □ 

Finally, we also need to show that the size of the gradient estimator which is controlled in expectation is 
also bounded with high probability. 



Lemma 7. 



P IE E > 2 & + 8 In J) v/2Tln(l/5)) < 5. 

\t=l i=i V 7 / ) 

Proof. We define the martingale X t = J^Li £<=i ) 2 " E tffM ) 2 - Th en 

using the bound on g t , and the 

bound on expectation from proof of Theorem j^J X t < Application of Hoeffding-Azuma inequality 

gives the result. □ 

We are now in a position to prove a high probability bound on the regret of Algorithm [2] when run with 
the gradient estimator g t instead of gt ■ 

Theorem 4. With probability at least 1 - ^, the regret of Algorithm^using the gradient estimator g t against 
oblivious adversaries is O (V(TK) 2 ^ 3 ^ . 

The proof essentially involves putting the lemmas together, along with the full information analysis of 
the quantity {u^ — x^) 1 g" t . 

Proof. Using Lemma |4j with probability at least 1-5/3 

t k v* 

t=l i=l u=l 
T K V* 



< E E min (E u ? ' s ') - min ( w '< s ') + \/ 2Tln 3 r+^ T 



t=l i=l v=l 
T V* 



< E E( u " " x t ) T 9 V t + IT + JlTln 3 ^. 



t=l v=l 
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Invoking Lemma |6j with probability at least 1-2(5/3, 



T V* 



3K 



t=l v=l 



2V I If + 1 ) V / 2Tln(3/<5) + 307TVVln(l/<S). 



Once again we note that we are doing exponentiated gradient descent on g t so that we get from proof of 
Theorem Q] 

TV 1 T K V 

E E( u " - **) T ^ - y ln A " + ^ E E E E <i(§t,i) 2 - 

t=l -u=l ' t=l 1=1 u=l 



Using Lemma 7 and setting S = A gives the statement of the theorem on optimizing for 7,77. 



□ 



Note that our regret analysis so far has been against a fixed comparator. When the adversary adapts to 
player sequence, the comparator is random as well and depends on player's moves. However, the comparator 
consists of delta vectors for every unit v. Hence, there are a total of K v possible comparators. Hence, we 
can take a union bound over all the comparators as well, and this increases our regret bound by a factor of 
V\nK at most. This gives us the following corollary. 



Corollary 1. With probability at least 
d(V 2 (TK) 2 / 3 ). 



the regret of Algorithm against adaptive adversaries is 



Comparison with results of Ganchev et al. (2009): We note that although our results are in the 



adversarial setup, the same results also apply to iid problems. In particular, using online-to-batch conversion 



techniques ( Cesa-Bianchi et al. 2001 ), we can show that, after T rounds, with high probability the allocations 



of our algorithm on each round is within 0(V 2 T 1 / 3 K 2 / 3 ) of the optimal allocation. This is a significant 
improvement on the result of Ganchev et al. (20091: it is straightforward to check that the proof they 



provide gives a corresponding upper bound no better than 0(T 1 / 4 ). As we shall see, the generalization to 
adversarial setups leads to improved performance in simulations. 



4.3 Lower bound on regret for integral allocations 

As mentioned in the previous section, the problem of if-armed bandits is a special case of the dark pools 
problem with integral allocations. Hence, we would like to leverage the proof techniques from existing lower 
bounds on the optimal regret in the JsT-armed bandits problem. As before we consider a special case with 
Vt = V at every round. Following Auer et al. (2003), we construct K different distributions for generating 

V with probability (| - 
~ (|2003|) 



the liquidities s\. At each round, the ith distribution samples s\ 
with probability \ for j ^ i. We now mimic the proof of Theorem 5.1 in 



e) and s* 



V 



Auer e t al- 
i i 



et al 



Let V t = J2 t 



We start with a lemma analogous to Lemma A.l of Auer et al (Auer 
Ej and E •£ denote expectations wrt the ith distribution and uniform reward distribution respectively. 

Lemma 8. Let f be a function of the reward sequence r taking values in [0, M\. Then 



Let 



E i /(r) < E mif f(r) + M 



1 -4e 2 



Proof. It is clear from Holder's inequality and Pinsker's inequality that 

E*[/(r)] - E unif [/(r)] < M||P, - P unif ||i < M^ZKUF^^). 
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Now we can proceed as in the proof of Auer et al. (2003) 



KLCP^ifHPi) =^KL(P unif (r t |r t _ 1 )||P i (r t ||r i _ 1 ) 



t=i 



E 

t=i 

T 



K 



E P unif(^>0)KL 



11 

2^2 



= E P unifK t >0)KL(i +£ " ] 



VifK 4 >0)KL(^+ e 



1 n 1 
2 +£|l 2 



As w* is integer valued, P um f(i>* > 0) < E um f[wf]- Hence 



KL(P unif ||P 4 ) < ^E unif K*] In (—L^ 
f=i ^ 

= E unif[^ ln ( I ^4 ? )- 



□ 



Using this lemma, we can prove a lower bound on the regret of any algorithm that plays integer valued 
allocations. 

Theorem 5. Any algorithm that plays integer valued allocations has expected regret that is ft 
Proof. The net reward of the algorithm when distribution i is picked is given by 



t=i 

T 

= E 



- i (\ \ 

E ^v t J +(-+e\E l v t i 



3=1 j^i 



TV 



TV 



: E E ^ 



4=1 



eEjVl. 



As in the proof of Theorem 5.1 of Auer et al. ( 2003 ), we now apply Lemma|8]to the function Vi of the reward 
sequence. As Vi £ [0, TV], we get 



EiM < E unif [V] + TVy2E unif [V] In {^-^ 
<E unif [V]+2TV £ JE unif [V]. 
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Then 



K K K 

£ Ej [Vi] < X: m + 2TVe £ ^/e^]. 



Now Yli=i ^unif[^»] ~ TV. Applying Jensen's inequality to the second term we get 

K 

^Ei[K t ] < TV + 2TVeVKTV. 



As the index i was chosen uniformly at random, averaging over this choice gives an expected bound on 
the reward of 

1 TV I TV 

Noting again that the reward of optimal comparator is still (5 + e) TV, we get that the expected regret is 



/ / TV TV 

nle(TV-- + 2TVeJ- 




Setting e optimally to cJ Mr, gives an Cl(y/TVK) lower bound. We also note that the lower bound of 



H,(V\/T InK) shown for continuous-valued allocations applies to the integer-valued case as well. Combining 
the two, we get that the regret is 

n{max{VTVK, VVTlnK}) = fi (Vf (VVK + Vy/hiK\ ) . 

□ 

There is a gap between our lower and upper bounds in this case. We do not know which bound is loose. 
4.4 Efficient sampling for integral allocations 

All that remains to specify in Algorithm [2] is the construction of the distribution p over subsets at every 
round. Since we don't know what the distribution is, we cannot sample from it easily it would seem. If K is 
small, one can use non-negative least squares to find the distribution that has the given marginals. However, 
once the number of venues K is large, p is a distribution over f ) subsets, for which the least squares solver 
might be too slow. One way around is to use the idea of greedy approximations in Hilbert Spaces from 



the classic paper of (Jones 1992). We can greedily construct a distribution on subsets which matches the 
marginals on every element approximately in an efficient manner. Exact sampling from the distribution 
without ever constructing it explicitly is also possible. The explicit algorithms giving the implementations 
can be found in the full version of the paper. 



5 Experimental results 



We compared four methods experimentally. We refer to Algorithms [T] and [2] as ExpGrad and Exp3 re- 



spectively. We also run the Optimistic Kaplan Meier estimator based algorithm of (Ganchev et al. 2009), 



which is called OptKM. Finally we implemented the parametric maximum likelihood estimation-allocation 



based algorithm described in (Ganchev et al. 2009) as well, which we call ParML. As we did not have 



access to real dark pool data, we decided to implement a data simulator similar to (Ganchev et al. 2009) 
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Figure 1: Cumulative rewards for each algorithm as a function of the number of rounds when run on the parametric 



model of 'Ganchev et al. \2009y averaged over 100 trials 




Cyiu. a'ivu Reivard al Each 




(a) (b) 
Figure 2: Allocations to the 2 venues and cumulative rewards for the different algorithms. Note the inability of 
ParML and OptKM to effectively switch between venues when distributions switch. ExpGrad and Exp3 also 
achieve higher cumulative rewards. 

We used a combination of a Zero Bin parameter and power law distribution to generate the s*'s while the 
sequence V* was kept fixed. Parameters for the Zero Bin and power law were set to lie in the same regimes 



as the ones observed in the real data of ( Ganchev et al. 2009 1 



We started by generating the data from the parametric model of ( Ganchev et al 
venues, T = 2000 to match the experiments of (Ganchev et al 
iid from Zero Bin 



2009) 



The values of sj's were sampled 
Power law distributions with appropriately chosen parameters. A plot of the resulting 



2009). We used 48 



cumulative rewards averaged over 100 trial runs can be seen in Figure [T] 

We see that ParML has a slightly superior performance on this data, understandably as the data is being 
generated from the specific parametric model that the algorithm is designed for. However, ExpGrad gets 
net allocations quite close to ParML. Furthermore, both Exp3 and ExpGrad are far superior to the 
performance OptKM which is our true competitor in some sense being a non-parametric approach just like 
ours. 

Next, we study the performance of all four algorithms under a variety of adversarial scenarios. We start 
with a simple setup of two venues. The parameters of the power law initially favor Venue 1 for 12500 rounds, 
and then we switch the power law parameters to favor Venue 2. We study both the cumulative rewards 
as well as the allocations to both venues for each algorithm. Clearly an algorithm will be more robust to 
adversarial perturbations if it can detect this change quickly and switch its allocations accordingly. We show 
the results of this experiment in Figure [2] 

Because of just 2 venues, rounding has a rather negligible effect in this case and both our methodshave 
an almost identical performance. Our algorithms ExpGrad and Exp3 switch much faster to the new 
optimal venue when distributions switch. Consequently, the cumulative reward of both our algorithms also 
turns out significantly higher as shown in Figure [2fb). 

We wanted to investigate how this behavior changes when the switching involves a larger number of 
venues. We created another experiment where there are 5 venues, maximum volume V = 200. Venues 1 and 
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Exp3 ExpGrad 




(a) (b) 



OptKM Parametric ML 




(c) (d) 
Figure 3: Allocations to the 5 venues for the different algorithms. Note the poor switching of OptKM between 
venues when distributions switch. ParML completely fails on this problem. Exp3 and ExpGrad correctly identify 
both long and short range trends (see text). 



Cumulative Reward at Each Round x ]0 b Cumulative Reward at Each Round 




(a) (b) 
Figure 4: Cumulative rewards for each algorithm when distributions switch between 5 venues, for V — 200 (left) and 
V — 400. Note the superior performance of ExpGrad andExp3. 

5 oscillate between getting very favorable and unfavorable f3 values (/3 is the power law exponent). Other 
venues also switch, but between less extreme values. Allocations to all 5 venues for each algorithm are shown 
in Figure [3j 

Once again both Exp3 and ExpGrad identify both the long range trend (favorability of venues 1, 5 
over the others) and short range trend (favoring venue 1 over 5 in certain phases). There is a gap between 
ExP3and ExpGrad this time, however, as rounding does start to play a role with 5 venues. OptKM adapts 
somewhat, although it still doesn't reach as high an allocation level as Exp3 after switching to a new venue. 
ParML completely fails to identify this switching. We also studied the behavior of algorithms as V is scaled 
on the same problem. Figure [4] plots the cumulative reward of each algorithm for V = 200 and V = 400. It 
is clear that ExpGrad and Exp3 still comprehensively outperform others. 



In summary, it seems that our algorithms are competitive with those of (Ganchev et al. 2009) when 



the data is drawn from their parametric model. When their assumptions about iid data are not satisfied, 
we significantly outperform those algorithms. We note that we have only experimented with oblivious 
adversaries here. The gulf in performance may be even wider for adaptive adversaries. 
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